from google.colab import drive
drive.mount('/content/drive/')
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
#replace with your own path for the folder
project_path = '/content/drive/MyDrive/AI Lab'
Mounted at /content/drive/
INTRODUCTION¶
Typically, cells grow near blood vessels that provide the oxygen they need to function. Cancer cells, however, multiply quickly, creating crowded environments near these vessels and resulting in oxygen deprivation (hypoxia). In such conditions, some cells die, while others develop mechanisms to survive, making them resistant to treatments and more aggressive. This project aims to predict whether a cell is in a hypoxic or normoxic state by examining their gene expression.
The study will analyze single-cell sequencing data from two cancer cell lines (HCC1806 and MCF7) using Smartseq and Dropseq techniques. The former is a single-cell RNA sequencing method that allows for the comprehensive analysis of the transcriptome of individual cells. The latter is an emulsion-based system: a mixture is prepared and sprayed at high intensity, with each droplet containing a 'microbead’, it reads positivity in a binary manner: each time a base is added, the mRNA binds to the barcoded oligonucleotides on the microbead, allowing the sequencing and identification of gene expression profiles from thousands of individual cells in a highly sensitive manner. Hypoxic cells were exposed to low oxygen levels (around 1%), while others remained in normal conditions.
Dealing with DNA is simpler than directly handling RNA, and this involves several steps and specialized machinery. Initially, a DNA sequencing machine generates a vendor-specific format file, which is then converted into a FASTQ file and ultimately into SAM/BAM files. The vendor-specific format file is a non-textual output from the machine, while the FASTQ file contains the sequence of bases along with their corresponding quality scores represented in ASCII characters. Finally, the SAM/BAM files consist of aligned sequence reads.
LIBRARIES AND OPENING THE FILES¶
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, log_loss, f1_score, mean_squared_error as mean_squared_error
# mcf smartseq
smart_mcf_meta = pd.read_csv(project_path + '/AILab24/SmartSeq/MCF7_SmartS_MetaData.tsv', sep='\t')
smart_mcf_unfiltered = pd.read_csv(project_path + "/AILab24/SmartSeq/MCF7_SmartS_Unfiltered_Data.txt", sep=' ')
smart_mcf_filtered = pd.read_csv(project_path + "/AILab24/SmartSeq/MCF7_SmartS_Filtered_Data.txt")
smart_mcf = pd.read_csv(project_path + '/AILab24/SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt', sep=' ')
# hcc smartseq
smart_hcc_meta = pd.read_csv(project_path + '/AILab24/SmartSeq/HCC1806_SmartS_MetaData.tsv', sep='\t')
smart_hcc_unfiltered = pd.read_csv(project_path + "/AILab24/SmartSeq/HCC1806_SmartS_Unfiltered_Data.txt", sep=' ')
smart_hcc_filtered = pd.read_csv(project_path + "/AILab24/SmartSeq/HCC1806_SmartS_Filtered_Data.txt")
smart_hcc = pd.read_csv(project_path + '/AILab24/SmartSeq/HCC1806_SmartS_Filtered_Normalised_3000_Data_train.txt', sep=' ')
# mcf dropseq
drop_mcf = pd.read_csv(project_path + "/AILab24/DropSeq/MCF7_Filtered_Normalised_3000_Data_train.txt", sep=' ')
# hcc dropseq
drop_hcc = pd.read_csv(project_path + "/AILab24/DropSeq/HCC1806_Filtered_Normalised_3000_Data_train.txt", sep=' ')
def rescaling(df):
df1 = df + 1
df1_log2 = df1.apply(np.log2)
return df1_log2
EXPLORATORY DATA ANALYSIS¶
Analysis of metadata¶
#N. of normo and hypo in HCC1806
normo = sum('Normoxia' in cell for cell in smart_hcc_meta.columns)
hypo = sum('Hypoxia' in cell for cell in smart_hcc_meta.columns)
print('The number of normo cells in HCC1806 with SMARTSEQ:', normo)
print('The number of hypo cells in HCC1806 with SMARTSEQ:', hypo)
print(normo + hypo == 243)
#N. of normo and hypo in MCF7
normo = sum('Normoxia' in cell for cell in smart_mcf_meta.columns)
hypo = sum('Hypoxia' in cell for cell in smart_mcf_meta.columns)
print('The number of normo cells in MCF7 with SMARTSEQ:', normo)
print('The number of hypo cells in MCF7 with SMARTSEQ:', hypo)
print(normo + hypo == 383)
In both cell lines, cells in a state of normoxia and those in a state of hipoxia are almost 50/50.
print("PCR plate: ", smart_hcc_meta.iloc[ : , 1])
ones = 0
twos = 0
threes = 0
fours = 0
for plate in smart_hcc_meta.iloc[ : , 1]:
if str(plate) == '1':
ones += 1
if str(plate) == '2':
twos += 1
if str(plate) == '3':
threes += 1
if str(plate) == '4':
fours += 1
print('The number of 1 PCR plates in HCC with SMARTSEQ:', ones)
print('The number of 2 PCR plates in HCC with SMARTSEQ:', twos)
print('The number of 3 PCR plates in HCC with SMARTSEQ:', threes)
print('The number of 4 PCR plates in HCC with SMARTSEQ:', fours)
print(ones + twos + threes + fours == 243)
#N. of lanes in MCF7
print("Lane: ", smart_mcf_meta.iloc[ : , 1])
ones = 0
twos = 0
threes = 0
fours = 0
for lane in smart_mcf_meta.iloc[ : , 1]:
if str(lane) == 'output.STAR.1':
ones += 1
if str(lane) == 'output.STAR.2':
twos += 1
if str(lane) == 'output.STAR.3':
threes += 1
if str(lane) == 'output.STAR.4':
fours += 1
print('The number of output.STAR.1 plates in MCF with SMARTSEQ:', ones)
print('The number of output.STAR.2 plates in MCF with SMARTSEQ:', twos)
print('The number of output.STAR.3 plates in MCF with SMARTSEQ:', threes)
print('The number of output.STAR.4 plates in MCF with SMARTSEQ:', fours)
print(ones + twos + threes + fours == 383)
In both cell lines, positions are pretty regularly distributed, except for a fewer number of PCR 4 plates (48) in HCC
#are all hours 24? HCC
print((smart_hcc_meta.iloc[:, 4] != 24).sum() == 0)
#are all hours 72? MCF
print((smart_mcf_meta.iloc[:, 4] != 24).sum() == 0)
Analysis of DropSeq¶
#Conditions in DropSeq HCC
normo = sum('Normoxia' in cell for cell in drop_hcc.columns)
hypo = sum('Hypoxia' in cell for cell in drop_hcc.columns)
print(normo, hypo, normo + hypo == 14682)
#Conditions in DropSeq MCF
normo = sum('Normoxia' in cell for cell in drop_mcf.columns)
hypo = sum('Hypoxia' in cell for cell in drop_mcf.columns)
print(normo, hypo, normo + hypo == 21626)
Analysis of sparsity¶
#Zeros in MCF Unfiltered SmartSeq
array = smart_mcf_unfiltered.to_numpy(dtype = np.int16)
zeros = np.sum(array == 0)
sparsity = zeros/(array.shape[0] * array.shape[1])
print(zeros, sparsity, array.shape[0] * array.shape[1])
#Zeros in HCC Unfiltered SmartSeq
array = smart_hcc_unfiltered.to_numpy(dtype = np.int16)
zeros = np.sum(array == 0)
sparsity = zeros/(array.shape[0] * array.shape[1])
print(zeros, sparsity, array.shape[0] * array.shape[1])
Violinplots¶
smart_mcf_unfiltered.head()
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam | output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam | output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam | output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam | output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam | ... | output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam | output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam | output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam | output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam | output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam | output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WASH7P | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| MIR6859-1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WASH9P | 1 | 0 | 0 | 0 | 0 | 1 | 10 | 1 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 4 | 5 |
| OR4F29 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| MTND1P23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 383 columns
print(smart_mcf_unfiltered.shape)
smart_mcf_unfiltered.describe()
(22934, 383)
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam | output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam | output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam | output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam | output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam | ... | output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam | output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam | output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam | output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam | output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam | output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | ... | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 |
| mean | 40.817651 | 0.012253 | 86.442400 | 1.024636 | 14.531351 | 56.213613 | 75.397183 | 62.767725 | 67.396747 | 2.240734 | ... | 17.362562 | 42.080230 | 34.692422 | 32.735284 | 21.992718 | 17.439391 | 49.242784 | 61.545609 | 68.289352 | 62.851400 |
| std | 465.709940 | 0.207726 | 1036.572689 | 6.097362 | 123.800530 | 503.599145 | 430.471519 | 520.167576 | 459.689019 | 25.449630 | ... | 193.153757 | 256.775704 | 679.960908 | 300.291051 | 153.441647 | 198.179666 | 359.337479 | 540.847355 | 636.892085 | 785.670341 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 17.000000 | 0.000000 | 5.000000 | 0.000000 | 7.000000 | 23.000000 | 39.000000 | 35.000000 | 38.000000 | 1.000000 | ... | 9.000000 | 30.000000 | 0.000000 | 17.000000 | 12.000000 | 9.000000 | 27.000000 | 30.000000 | 38.000000 | 33.000000 |
| max | 46744.000000 | 14.000000 | 82047.000000 | 289.000000 | 10582.000000 | 46856.000000 | 29534.000000 | 50972.000000 | 36236.000000 | 1707.000000 | ... | 17800.000000 | 23355.000000 | 81952.000000 | 29540.000000 | 12149.000000 | 19285.000000 | 28021.000000 | 40708.000000 | 46261.000000 | 68790.000000 |
8 rows × 383 columns
print(smart_mcf_unfiltered.iloc[:, 0])
WASH7P 0
MIR6859-1 0
WASH9P 1
OR4F29 0
MTND1P23 0
...
MT-TE 4
MT-CYB 270
MT-TT 0
MT-TP 5
MAFIP 8
Name: output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam, Length: 22934, dtype: int64
sns.violinplot(x = smart_mcf_unfiltered.iloc[:, 0])
plt.title('Gene Expression profile of the first cell')
plt.show()
df_reduced = smart_mcf_unfiltered.iloc[:, :50]
plt.figure(figsize = (16, 6))
sns.violinplot(data = df_reduced, palette = 'Set3', cut = 0)
plt.xticks(rotation = 90)
plt.title("Distributions of a sample of gene expression profiles")
plt.show()
sns.violinplot(x = smart_mcf_unfiltered.iloc[0, :])
plt.title('Distributions of the expression levels across cells for the first gene')
plt.show()
df_reduced = smart_mcf_unfiltered.iloc[:50, :].T
plt.figure(figsize = (16, 6))
sns.violinplot(data = df_reduced, palette = 'Set3', cut = 0)
plt.xticks(rotation = 90)
plt.title("Distributions of the expression levels across cells for a sample of genes")
plt.show()
We notice that the data is sparse and that some reads are way bigger than others
Skewness and kurtosis filtered vs unfiltered¶
from scipy.stats import kurtosis, skew
def plot_skewness(df):
colN = np.shape(df)[1]
cnames = list(df.columns)
df_skew_cells = []
for i in range(colN):
df_skew_cells += [skew(df[cnames[i]])]
df_skew_cells
sns.histplot(df_skew_cells, bins = 100)
plt.xlabel('Skewness of single cells expression profiles')
plot_skewness(smart_mcf_unfiltered)
def plot_kurtosis(df):
colN = df.shape[1]
cnames = list(df.columns)
df_kurt_cells = []
for i in range(colN):
df_kurt_cells += [kurtosis(df[cnames[i]])]
df_kurt_cells
sns.histplot(df_kurt_cells, bins = 100)
plt.xlabel('Kurtosis of single cells expression profiles')
plot_kurtosis(smart_mcf_unfiltered)
The distribution of unfiltered genes are highly non-normal, skewed and with heavy tails.
plot_skewness(smart_mcf_filtered)
plot_kurtosis(smart_mcf_filtered)
Kurtosis and skewness are not affected by the filtering
plot_skewness(smart_mcf)
plot_kurtosis(smart_mcf)
Normalization, even though effects skewness and kurtosis, does not make them vanish
plot_skewness(rescaling(smart_mcf_unfiltered))
plot_kurtosis(rescaling(smart_mcf_unfiltered))
plot_skewness(rescaling(smart_mcf_filtered))
plot_kurtosis(rescaling(smart_mcf_filtered))
Here the kurtosis is negative since it calculated centered at the kurtosis of the Gaussian, which is 3.
plot_skewness(rescaling(smart_mcf))
plot_skewness(rescaling(smart_mcf))
We notice that the log2-rescaling of data succesfully makes skewness and kurtosis vanish.
Normalization¶
sns.displot(data = smart_mcf_unfiltered.iloc[:, :30], palette = "Set3", kind = "kde", bw_adjust = 2)
<seaborn.axisgrid.FacetGrid at 0x7d52de9c7d90>
sns.displot(data = rescaling(smart_mcf_unfiltered).iloc[:, :30], palette = "Set3", kind = "kde", bw_adjust = 2)
<seaborn.axisgrid.FacetGrid at 0x7d52df138d00>
sns.displot(data = smart_mcf_filtered.iloc[:, :30], palette = "Set3", kind = "kde", bw_adjust = 2)
<seaborn.axisgrid.FacetGrid at 0x7d52ebfa76d0>
sns.displot(data = rescaling(smart_mcf_filtered).iloc[:, :30], palette = "Set3", kind = "kde", bw_adjust = 2)
<seaborn.axisgrid.FacetGrid at 0x7d52ded679a0>
sns.displot(data = smart_mcf.iloc[:, :30], palette = "Set3", kind = "kde", bw_adjust = 2)
<seaborn.axisgrid.FacetGrid at 0x7d52de9eaf50>
sns.displot(data = rescaling(smart_mcf).iloc[:, :30], palette="Set3", kind="kde", bw_adjust=2)
<seaborn.axisgrid.FacetGrid at 0x7d52e35d43a0>
While with normalization data still remains concentrated around a very spiking peak, log2-rescaling smoothens the distributions out more succesfully
Mean - variance plots¶
def mean_var_plot(df):
sns.set_style("white")
fig = plt.figure(figsize=(8, 5))
variance=(np.log2(1 + df).var())
mean = np.log2(1 + df).mean()
plt.scatter(mean, variance, alpha=0.5, c="r", cmap='viridis')
plt.xlabel("Mean")
plt.ylabel("Variance")
plt.title("Mean vs variance of genes - logged data", fontsize=15)
mean_var_plot(smart_mcf_filtered)
mean_var_plot(smart_mcf_unfiltered)
mean_var_plot(smart_mcf)
from sklearn.linear_model import LinearRegression
def mean_var_plot(df):
sns.set_style("white")
fig = plt.figure(figsize=(6, 4))
variance=(np.log2(1 + df).var())
mean = np.log2(1 + df).mean()
x = mean.values.reshape(-1, 1)
y = variance.values.reshape(-1, 1)
# Fit linear regression model
model = LinearRegression()
model.fit(x, y)
# Predict y values based on the linear regression model
y_pred = model.predict(x)
plt.scatter(mean, variance, alpha=0.5, c="b")
plt.plot(mean, y_pred, color='red', label='Line of Best Fit')
plt.xlabel("Mean")
plt.ylabel("Variance")
plt.title("Mean vs variance of genes - logged data", fontsize=15);
# Calculate the percentage of zeros in each column
sparsity_unf_gene = (smart_mcf_unfiltered.T == 0).sum() / len(smart_mcf_unfiltered.T) * 100
sparsity_unf_gene.sort_values()
KRT8 1.305483
ACTB 1.566580
ALDOA 1.566580
KRT18 1.566580
GAPDH 1.566580
...
PECAM1 99.477807
RPL23AP23 99.477807
RPL26P29 99.477807
BEND7P1 99.477807
SLIT3 99.477807
Length: 22934, dtype: float64
dense_columns = sparsity_unf_gene[sparsity_unf_gene[:] <= 98.236]
filtered_df_gene = smart_mcf_unfiltered.T[dense_columns.index]
print("Filtered DataFrame shape:", filtered_df_gene.shape)
print("Unfiltered DataFrame shape:", smart_mcf_unfiltered.T.shape)
print("Given filtered DataFrame shape:", smart_mcf_filtered.T.shape)
Filtered DataFrame shape: (383, 18714) Unfiltered DataFrame shape: (383, 22934) Given filtered DataFrame shape: (313, 18945)
df_smart_mcf_filtered_sp = filtered_df_gene.T
df_trans_sp = rescaling(df_smart_mcf_filtered_sp)
mean_var_plot(df_trans_sp)
mean_var_plot(rescaling(smart_mcf_unfiltered))
mean_var_plot(smart_mcf_unfiltered)
mean_var_plot(rescaling(smart_mcf_filtered))
mean_var_plot(smart_mcf_filtered)
mean_var_plot(rescaling(smart_mcf))
mean_var_plot(smart_mcf)
Duplicate rows¶
duplicate_rows_df = df_trans_sp[df_trans_sp.duplicated(keep=False)]
print("number of duplicate rows: ", duplicate_rows_df.shape)
print("number of duplicate rows: ", duplicate_rows_df)
number of duplicate rows: (10, 383)
number of duplicate rows: output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam \
UGT1A8 5.857981
UGT1A9 5.857981
PANDAR 0.000000
LAP3P2 0.000000
SUGT1P4-STRA6LP 0.000000
STRA6LP 0.000000
LINC00856 0.000000
LINC00595 0.000000
CCL3L3 0.000000
CCL3L1 0.000000
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 1.0
LAP3P2 1.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam \
UGT1A8 0.000000
UGT1A9 0.000000
PANDAR 0.000000
LAP3P2 0.000000
SUGT1P4-STRA6LP 1.584963
STRA6LP 1.584963
LINC00856 0.000000
LINC00595 0.000000
CCL3L3 0.000000
CCL3L1 0.000000
output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam ... \
UGT1A8 0.0 ...
UGT1A9 0.0 ...
PANDAR 0.0 ...
LAP3P2 0.0 ...
SUGT1P4-STRA6LP 0.0 ...
STRA6LP 0.0 ...
LINC00856 0.0 ...
LINC00595 0.0 ...
CCL3L3 0.0 ...
CCL3L1 0.0 ...
output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 1.0
LAP3P2 1.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam \
UGT1A8 2.584963
UGT1A9 2.584963
PANDAR 0.000000
LAP3P2 0.000000
SUGT1P4-STRA6LP 4.321928
STRA6LP 4.321928
LINC00856 0.000000
LINC00595 0.000000
CCL3L3 0.000000
CCL3L1 0.000000
output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 2.0
CCL3L1 2.0
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 1.0
LAP3P2 1.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 3.0
LINC00595 3.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam \
UGT1A8 0.0
UGT1A9 0.0
PANDAR 1.0
LAP3P2 1.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam
UGT1A8 0.0
UGT1A9 0.0
PANDAR 0.0
LAP3P2 0.0
SUGT1P4-STRA6LP 0.0
STRA6LP 0.0
LINC00856 0.0
LINC00595 0.0
CCL3L3 0.0
CCL3L1 0.0
[10 rows x 383 columns]
print("names of duplicate rows: ", duplicate_rows_df.index)
duplicate_rows_df_t = duplicate_rows_df.T
duplicate_rows_df_t
c_dupl = duplicate_rows_df_t.corr()
c_dupl
names of duplicate rows: Index(['UGT1A8', 'UGT1A9', 'PANDAR', 'LAP3P2', 'SUGT1P4-STRA6LP', 'STRA6LP',
'LINC00856', 'LINC00595', 'CCL3L3', 'CCL3L1'],
dtype='object')
| UGT1A8 | UGT1A9 | PANDAR | LAP3P2 | SUGT1P4-STRA6LP | STRA6LP | LINC00856 | LINC00595 | CCL3L3 | CCL3L1 | |
|---|---|---|---|---|---|---|---|---|---|---|
| UGT1A8 | 1.000000 | 1.000000 | 0.001888 | 0.001888 | 0.094395 | 0.094395 | -0.023056 | -0.023056 | -0.030130 | -0.030130 |
| UGT1A9 | 1.000000 | 1.000000 | 0.001888 | 0.001888 | 0.094395 | 0.094395 | -0.023056 | -0.023056 | -0.030130 | -0.030130 |
| PANDAR | 0.001888 | 0.001888 | 1.000000 | 1.000000 | 0.057092 | 0.057092 | -0.034693 | -0.034693 | 0.003298 | 0.003298 |
| LAP3P2 | 0.001888 | 0.001888 | 1.000000 | 1.000000 | 0.057092 | 0.057092 | -0.034693 | -0.034693 | 0.003298 | 0.003298 |
| SUGT1P4-STRA6LP | 0.094395 | 0.094395 | 0.057092 | 0.057092 | 1.000000 | 1.000000 | -0.053179 | -0.053179 | 0.081243 | 0.081243 |
| STRA6LP | 0.094395 | 0.094395 | 0.057092 | 0.057092 | 1.000000 | 1.000000 | -0.053179 | -0.053179 | 0.081243 | 0.081243 |
| LINC00856 | -0.023056 | -0.023056 | -0.034693 | -0.034693 | -0.053179 | -0.053179 | 1.000000 | 1.000000 | -0.019943 | -0.019943 |
| LINC00595 | -0.023056 | -0.023056 | -0.034693 | -0.034693 | -0.053179 | -0.053179 | 1.000000 | 1.000000 | -0.019943 | -0.019943 |
| CCL3L3 | -0.030130 | -0.030130 | 0.003298 | 0.003298 | 0.081243 | 0.081243 | -0.019943 | -0.019943 | 1.000000 | 1.000000 |
| CCL3L1 | -0.030130 | -0.030130 | 0.003298 | 0.003298 | 0.081243 | 0.081243 | -0.019943 | -0.019943 | 1.000000 | 1.000000 |
sns.pairplot(duplicate_rows_df_t)
<seaborn.axisgrid.PairGrid at 0x7d52dd6499f0>
duplicate_rows_df_t.describe()
| UGT1A8 | UGT1A9 | PANDAR | LAP3P2 | SUGT1P4-STRA6LP | STRA6LP | LINC00856 | LINC00595 | CCL3L3 | CCL3L1 | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 |
| mean | 0.137891 | 0.137891 | 0.073107 | 0.073107 | 0.586322 | 0.586322 | 0.032859 | 0.032859 | 0.093464 | 0.093464 |
| std | 0.739790 | 0.739790 | 0.260653 | 0.260653 | 1.363790 | 1.363790 | 0.266343 | 0.266343 | 0.579702 | 0.579702 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 6.303781 | 6.303781 | 1.000000 | 1.000000 | 6.129283 | 6.129283 | 3.000000 | 3.000000 | 6.066089 | 6.066089 |
df_trans_sp.count()
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam 18714
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam 18714
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam 18714
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam 18714
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam 18714
...
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam 18714
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam 18714
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam 18714
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam 18714
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam 18714
Length: 383, dtype: int64
df_noDup = df_trans_sp.drop_duplicates()
df_trans_sp.count()
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam 18714
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam 18714
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam 18714
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam 18714
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam 18714
...
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam 18714
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam 18714
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam 18714
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam 18714
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam 18714
Length: 383, dtype: int64
Data structure¶
plt.figure(figsize=(10,5))
#df_small = df.iloc[:, :50]
#c= df_small.corr()
c= df_trans_sp.corr()
midpoint = (c.values.max() - c.values.min()) /2 + c.values.min()
#sns.heatmap(c,cmap='coolwarm',annot=True, center=midpoint )
sns.heatmap(c,cmap='coolwarm', center=0 )
print("Number of cells included: ", np.shape(c))
print("Average correlation of expression profiles between cells: ", midpoint)
print("Min. correlation of expression profiles between cells: ", c.values.min())
Number of cells included: (383, 383) Average correlation of expression profiles between cells: 0.4952313892879932 Min. correlation of expression profiles between cells: -0.009537221424013665
df_trans_sp = rescaling(df_smart_mcf_filtered_sp)
df_small = df_trans_sp.iloc[:, 10:30]
sns.pairplot(df_small)
<seaborn.axisgrid.PairGrid at 0x7d52de17d930>